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Abstract 


In  Ais  su.dy  wc  introduce  and  lest  severni  mctltods  to  reduce  the  computattonal  cost  m 
dynamic  programming  algoridrnts  for  isolated  word  recognition  systems.  Tlrree  methods  wdl  be 
discussed  in  deuail:  1.)  Pruning  b.  preset  thresholds  2.)  Search  based  on  the  Branch  and  Bound 
technique  3.)  Branch  and  Bound  based  search  with  additional  pruning.  Compared  to 
conventional  algorithms.  Method  3.)  could  be  seen  to  yield  a  speed  up  of  approximately  a  factor 
of  5  at  no  loss  of  recognition  accuracy.  Ihe  branch  and  bound  method  with  pruning  is  aso 
Ideally  suited  for  research  oriented  systems,  since  pruning  is  independent  of  the  parameuuation 
used  (eliminates  the  necessity  for  retunlng  diresholds).  Additional  features  of  this  method,  which 
are  of  importance  to  maintaining  dte  llexibility  and  diagnosticity  needed  for  such  a  system,  wi 


be  discussed. 
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1 .  Introduction 

For  the  development  of  practical  speech  recognition  systems,  computation  speed  is  one  of  the 
predominant  design  factors.  Sevend  commercially  available  systems  still  employ--in  terms  of 
recognition  accuracy-inferior  linear  time  normalisation  techniques  to  account  for  speaking  rate 
variations,  since  the  dynamic  programming  (DP)  -technique  is  compiiuitioiially  very  costly.  Even 
in  a  research  cn\  ironmcnt.  the  turn-around  time  for  larger  experimental  inns  over  large  speech 
data-bases  can  easily  be  in  tlie  order  of  days  or  weeks.  Consequently,  several  metliods  have  been 
employed  to  reduce  die  redundancies  in  isolated  word  recognition  systems.  Referring  to  the 
commonly  used  DP-matching  techniques,  as  used  by  Sakoe  and  Chiba,  Itakura,  Rabiner  and 
others^^^  it  can  be  seen  that  the  bottleneck  of  nonlinear  time  nonnalization  is  given  by  the 
number  of  points  within  a  matrix-defined  by  the  frames  of  an  unknown  utterance  x  and  a 
known  reference  utterance  >— that  are  needed  to  find  an  optimum  matching  path.  'Fhe 
computation  needed  for  each  of  these  points  includes  the  computation  of  a  distance  between  the 
particular  test-frame  and  reference-frame  under  consideration  and  the  deri\ation  of  a  cumulative 
score  defined  by  the  constraints  of  the  DP-matching  algorithm  in  use.  In  a  computer  program 
that  performs  DP-matching,  tliese  operations  will  typically  constitute  the  innermost  loop  and 
therefore  be  the  most  repetitious  and  most  expensive  in  time.  Finding  less  expensive  warping 
constraints  or  distance  functions,  however,  will  in  most  cases  yield  a  loss  in  recognition  accuracy. 
Tw  othei  tuethcnls  havt!  been  uievi  b;>  ItfkiA  fit  aral  bv  Rabiwitl  llhe  is  tSrt 
definition  of  a  window^  around  the  diagonal  of  the  w  arping  matrix  tliat  defines  the  boundaries  of 
any  allowable  warping  path.  This  definition  is  not  only  useful  but  also,  for  some  warping 
funcrioTiS,  Ticcdtd  to  prohibit  pussib’e  ftufi-*nigui?tic  pirfhs  itiTuug)'!  the  ■matrix.  Rt  ductiun  id  ti'u, 
width  of  this  window  tlius  increases  computational  speed  significantly.  It  has  been  shown'*  that  a 
window  that  restricts  die  warp  search  path  to  lead  or  lag  behind  a  linearly  time-normalized  match 
by  not  more  tlian  50  msecs  is  the  optimal  choice  for  an  isolated  word  recognition  systeir.  using  an 
alpha-digit  vocabulary.  Such  a  window  constraint  was  seen  to  not  only  provide  a  computational 
saving  of  up  to  70%  but  also  in  some  cases  to  increase  recognition  accuracy. 


1 


ll  .should  be  noicd  ihai.  in  ihai  paper,  the  window  was  nor  mainly  iniroduced  for  efficiency  reasons. 
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Further  methods  ha\e  been  suggested  to  increase  computational  cfTiciency,  In  the. following 
chapter  we  will  briefly  describe  a  method  suggested  by  Rabiner  et  al^  and  Uien  introduce  two 
alternate  methods.  In  the  tliird  chapter  we  will  report  die  results  of  extensive  testing  on  all 
methods  reported  here. 
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2.  Efficient  Algorithms  for  Non-linear  Time 
Warping 

In  this  chapter  wc  will  describe  uiree  methods  currently  in  use  in  our  isolated  word  recognition 
system  to  perform  dynamic  programming  in  an  cfTicicnt  manner.  Most  methods  are  based  on 
the  idea  tlint--analogous  to  the  presumed  strategy  of  human  perception  --selection  of  the  correct 
candidate  out  of  a  reference  vocabulary,  can  be  perfonned  in  an  anticipatory  way,  by  process  of 
elimination.  In  other  words,  particularly  inappropriate  candidates  can  be  discarded  comparably 
early  in  tlie  matching  process,  i.e.,  the  match  can  be  aborted. 

2.1  Preset  Thresholds 

In  tltis  way,  Rabiner  et  al.  have  obtained  significant  reductions  in  computation  cost.  Two 
tlircsholds  arc  predefined,  denoted  Tmin  and  Tslopc.  The  computation  of  the  warp  is  performed 
by  computing  the  distances  and  the  Itakura  warping  function^  between  a  given  frame  i  in  the  test 
token  and  a  column  (specified  by  die  search  space)  of  reference  frames  (sec  Fig.2-1).  For  each  of 
these  grid  points  a  cumulative  dissimilarity  score  of  the  best  path  leading  to  this  point  is  obtained 
in  this  fashion.  Hie  minimum  score  out  of  these  cumulative  scores""localmin""is  determined 
and  compared  to  the  llircshold  Tj.^ 

If  localmin>Tj  the  warp  is  aborted  and  recognition  proceeds  to  the  next  candidate;  Tj  is  given 
by 

Tj  =  (Tmin  -+-  iTslopc)N 

where  N  is  the  number  of  frames  in  the  test  utterances.  Referring  to  Fig.2-2,  it  can  be  seen  that 
Tslopc  can  be  viewed  as  N  times  Utc  average  distance  tliat  can  be  added  to  tlic  cumulative  score 
along  the  search  path  without  causing  the  pruning  mechanism  to  abort  the  match.  Tlic  factor  N 
provides  a  further  adjustment  depending  on  utterance  length.  Both  Tmin  and  'I'slnpc  have  to  be 
sot  in  such  a  fashion  that  they  minimi/c  compuuition  (for  efficiency)  but  arc  generous  enough  to 
not  degrade  recognition  performance  (c.g.,  by  aborting  "a  good  match"). 


■^Nole  that  the  icchniqucs  described  here  would  base  lo  be  .altered  if  diffcrenl  warping  algorithms  were  used.  The 
Itakura  algorithm  appears  particularly  practicil  for  these  mclhod.s. 
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Restriction  of  the  Search  Space  via  an  Adjustment  Window 
The  dotted  area  indicates  computational  saving  through  the  use 
of  the  window  constraint.  Tolerance  t  is  used  as  a  measure 
of  the  width  as  well  a^'  the  saving  achieved. 


Figure  2-1:  Warping  Plane  Indicating  the  Search  Space  of  the  Itakura  Algorithm 
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Figure  2-2:  Paining  Using  the  Preset  Thresholds  Tmin  and  Tslope 

2.2  Branch  and  Bound 

In  a  research  oriented  speech  recognition  system  it  is  for  experimentation  sometimes  desirable 
to  ensure  tliat  recognition  results  are  not  affected  by  pruning  mechanisms,  i.e.,  tliat  tlicy  arc 
guaranteed  to  reflect  tlic  differences  in  the  overall  dissimilarity  scores  derived  from  all  matches, 
only.  Nevertheless,  one  would  want  to  avoid  unnecessary  compulation.  ITiis  is  provided  by  a 
method  tliat  is  based  on  tlie  "branch  and  bound"  search  technique,  'lliis  technique  requires  that 
llie  various  matches  of  a  recognition  be  performed  in  parallel. 

What  we  mean  by  this  "parallel  warping"  technique  is  illustrated  in  r'ig.2-3.  Instead  of 
performing  all  matches  sequentially,  each  frame  i  in  tlie  test-token  is  matched  with  the 
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Referencefranies 


Figure  2*3:  Parallel  Warping  Planes 

corresponding  frames  of  the  K  reference  tokens  of  a  K-word  vocabulary.  Fig.2-3  illustrates  this 
technique  by  adding  a  dimension  (k)  to  die  warping  process  (usually  depicted  as  a  warping 
plane).  In  tliis  fashion  K  waiping  planes  are  considered  at  a  time.  Infonnation  about  the 
goodness  of  the  matches  with  all  the  tokens  in  the  reference  socabulary  is  available  at  all 
intermediate  stages  ip  during  the  waip.  Several  methods  to  prune  comparatively  bad  matches 
suggest  thcmschcs.  For  tlic  "branch  and  bound"-bascd  technique,  however,  we  do  not  prune 
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away  a  bad  match.  Rather,  only  ilic  so  far  least  expensive  mr^feh  (the  one  with  the  so  far  lowest 
"localniin"-valiic)  is  expanded.  'ITiis  means  that,  instead  of  warping  a  particular  test-frame  ip 
against  the  various  frames  of  the  K  reference  tokens,  tlie  so  far  best  match  out  of  the  K  matches 
is  warped  (tliiis  proceeding  in  / )  regardless  of  the  momentary  position  in  /  of  its  search  path. 
This  method  is  illustrated  by  Fig.2-4.  which  depicts  the  projection  of  the  search  paths  onto  die  ik- 
plane  for  the  parallel  warp  and  the  "branch  and  boiind""bascd  parallel  warp.  Clearly,  in  the 
branch  and  bound  method  bad  matches-i.e.  matches  between  strongly  differing  speech  signals-- 
will  accumulate  high  di.suinccs  and  tlierefore  be  left  behind.  As  soon  as  the  best  match  reaches 
the  end  of  the  test  utterance,  the  recognition  process  is  completed.  Thus,  implicit  pruning  is 
perfonned  on  all  other  matches.  This  method  has  the  advantage  of  guaranteeing  that  the  lowest 
dissimilarity  score  will  be  found  and  tints  it  provides  identical  recognition  results  as  if  no  pruning 
were  performed.  As  an  additional  advantage  for  research  oriented  system,  it  should  be  noted  that 
users  can  specify  a  value  n  to  obtain  the  n  best  matches  in  tlie  recognition,  while  the  least  amount 
of  computation  is  being  performed  necessary  to  obtain  tlie  n  best  matches.  However,  if  n»l,  of 
course,  the  computational  saving  will  be  minimal. 

2.3  Branch  and  Bound  with  Pruning 

In  many  cases,  such  as  practical  recognition  systems  as  well  as  during  large  production  runs  of 
research  oriented  recognition  systems,  it  often  docs  not  matter  to  prcscrv'c  tlie  exact  individual 
recognition  outcomes,  as  long  as  the  O'  crall  number  of  errors  is  not  increased  when  pmning  is 
pcrfomicd.  If  this  is  the  case,  the  branch  and  bound  method,  described  above,  can  be  further 
extended  to  fiirthcr  reduce  computation  time.  ITius.  every  time  a  path  is  expanded  by  means  of 
continuing  its  warp,  the  number  of  frames  tliat  its  search  path  is  tlich  leading  before  or  lagging 
behind  any  other  path  is  detennined.  If  tliis  number  exceeds  the  thresliold  Lcadt,  this  other  paJi 
is  pruned  off.  Lcadt  is  given  by 
Lcadt  =  P/100  N  -b  1 

where  P  is  a  uscr-dcfmed  percentage  and  N  the  number  of  frames  in  the  test  utterance. 

Thus  using  the  illustration  in  Iug.2-4.  if  we  were  expanding  path  1  to  ij  and  if  i^-i,>Lcadt, 
match  2  would  be  aborted.  In  addition  to  drastically  decreasing  the  computation.il  effort,  tliis 
pruning  method  is  entirely  independent  of  tlie  numerical  values  of  the  distances,  scores,  and 
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Figure  2-4:  Expanding  Search  Paths  in  Parallel  Warping  Algorithms 

spectral  coefficients.  It  is  therefore  ideally  suited  for  systems  in  a  developmental  stage.  Using 
other  pruning  methods,  frequent  changes  in  tlic  representation  of  tlie  speech  signal  would  cause 
the  necessity  for  repeated  rctuning  of  thresholds  to  optimally  trade  off  recognition  accuracy  and 
computational  saving. 
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3.  Testing 

As  a  measure  of  the  computation  needed  using  the  algorithms  described  above,  we  use  the 
total  number' of  grid  points  (of  the  warp  search  space)  tliat  were  computed  for  each  speaker  and 
the  run  time.  As  testing  conditions,  die  algoriuims  were  run  on  5  data  sets,  36  utterances  each 
(the  alpha-digii  .jcabulary)  for  8  speakers  (4  male,  4  female).  As  reference  data-set  for  cacn 
speaker,  a  36-uttcrar.cc  reference  set  was  generated  from  5  additional  readings  of  tlic  vocabulary 
A  detailed  description  of  the  recognition  system  can  be  found  elsewhere'^.  It  should  be 
pointed  out,  however,  that  entirely  automatic  endpoint  detection  was  used;  no  manual  tuning 
was  performed.  Some  of  the  recognition  errors  reported  in  these  results  arc  due  to  errors  in  the 
endpoint  detection. 

The  results  of  these  experimental  runs  arc  show  n  in  figures  3-1  through  3-6. 

The  computational  cost  of  the  various  algorithms  tested  is  presented  in  figures  3-1  and  3-2. 
The  criterion  for  tlicsc  graphs  was  to  minimize  cost  under  the  constraint  of  maintaining  die  same 
or  reducing  error  rate  as  compared  to  a  conventional  Igorithm.  The  results  arc  presented  in 
Fig.3-1  in  terms  of  the  number  of  grid  points  needed  to  compute  the  180  recognition  of  the  test 
data  base  and  in  3-2  in  terms  of  die  average  run  time  per  recognition  in  msec.  The  first  measure 
was  chosen  to  provide  a  machine  independent  estimate  of  die  savings  obtained.  As  can  be  seen 
from  Fig.3-2  in  comparison  to  Fig.3-1.  this  docs  not  directly  translate  into  run  dmc 
improvements,  as  we  reduce  die  number  of  grid  points.  ITiis  is  so,  since  in  those  eases,  the 
number  of  grid  points  ceases  to  be  die  predominant  factor  contributing  to  computational  cost 
and  tha  overhead  outside  die  innermost  warping  loop  has  to  be  considered.  In  both  graphs, 
algorithm  1  -labeled  no  pmning.  no  window-  performs  an  exhaustive  search  of  die  itakura  warp^ 
algoridim  2  (no  pruning,  t  =  5  window)  is  algorithm  1  with  die  additional  adjustment  window 
constraint,  that  was  previously  reported"*  to  yield  better  performance  in  accuracy  and  efficiency. 
Finally,  the  results  for  the  algoridmis  3.  4,  5,  arc  shown,  i.c.,  for  the  branch  and  bound  w  ith  no 
pruning  (i.c.,  P=  100),  the  method  of  preset  dirc.sholds  and  the  branch  and  bound  mcdiod  with 
pnining  (P  =  15).  as  described  earlier.  Using  the  fastest  algorithm,  our  particular  implcmcntadon 
of  the  recognition  system  (aiiiiiing  on  a  VAX-TSO)  operates  in  less  dian  2.5  times  real  time. 
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Figure  3-5:  Average  Hrror  Kate  vs.  Pruning  Fa  Jor  P 
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Comparing  Fig.3-1  with  Fig.3-2,  wc  also  see  lliat  tlie  run  time  impro\’ements  as  given  by  the 
branch  and  bound  method  w  ith  paining  are  not  as  substantial  as  indicated  by  the  saving  of  grid 
points  in  Fig.3-1.  lliis  beha\  ior  is  due  to  the  larger  overhead  needed  to  perfonn  the  branch  and 
bound  search  (comparing  local  minima  among  tJic  references).  This  indicates  that  for 
vocabularies  substantially  larger  than  40  words  an  alternate  search  strategy  might  result  in  faster 
operation. 

In  figure  3-3  tlie  recognition  results  for  the  branch  and  bound  based  pruned  algorithm  are 
shown  for  various  values  of  the  pruning  factor  F.  Results  are  plotted  separately  for  the  eight 
speakers  in  our  data  base  and  are  given  in  terms  of  error  rate  (pereent  eonfused).  Notiee  that 
P=100  means  that  no  paining  is  performed  and  the  algorithm  operates  based  on  the  branch  and 
bound  technique  only.  From  figure  3-3  it  ean  be  seen  that  pruning  does  not  necessarily  have  to 
be  associated  with  an  increase  in  error  rate.  In  fact  for  P=  15  or  20,  improvements  of  up  to  3%  or 
4%  can  be  observed  for  some  speakers.  This  is  tnie,  since  in  many  cases  two  initially  badly 
matehing  utterances  such  as  a  "B"  and  a  "C".  will  be  prevented  from  leading  to  eonfusion  due  to 
the  pruning  operation.  Notice  also  -independently  of  the  paining  and  in  agreement  with  earlier 
results^  and  otlicr  studies-  the  relatively  high  speaker  dependency  of  the  recognition  results. 

Figure  3-4  depicts  the  corresponding  computational  cost  in  tenns  of  the  number  of  grid  points 
in  tlie  search  space,  i.c.,  the  number  of  times  the  innermost  loop  of  the  algoritlim  has  to  be 
executed  for  the  ISO  recognitions  of  tlie  testing  corpus.  As  could  be  expected,  tlie  number  of  grid 
points  tliat  need  to  be  computed  decreases  monotonically  with  decreasing  pruning  factor.  A 
pruning  factor  of  15  or  20  (which  yields  acceptable  recognition  pcrfonriancc)  will  achieve  a 
reduction  of  grid  points  by  a  factor  of  2  to  3.  Notice  also,  tliat  while  the  curves  for  the  different 
speakers  behave  similarly  as  a  function  of  the  pruning  factor  in  a  qualitative  way,  tlicir  actual 
quantitative  values  do  differ  strongly  across  spctikcrs.  'Ihis  speaker  dependency  in  speed  (up  to  a 
factor  of  two)  has  to  be  considered  should  cenain  run  times  be  required  in  a  practical  recognition 
system. 

To  sumn.iiri/c  these  observations  in  a  very  caidc  way  wc  have  taken  the  freedom  for  the 
purpose  of  illustration  to  average  our  results  over  the  eight  speakers  ,as  shown  in  figures  3-5  and 
3-6.  A  value  of  F  =  20  can  be  seen  to  yield  lowest  error  rates  while  a  value  of  P=15  still  leads  to 
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equivalent  performance.  ITiis  suggests  that  enough  discriminatory  confidence  is  accumulated 
when  the  path  of  an  inappropriate  reference  candidate  fulls  behind  die  best  path  by  more  than  20 
percent  of  die  length  of  the  test  token,  lliis  result  shows  that  a  search  algoridim  with  paining, 
i.c.,  an  algorithm  that  docs  NO  I' perform  an  exhaustive  search  for  the  optimal  score  often  times 
improves  performance  by  virtue  of  imposing  addidonal  eonstraint  on  the  search.  Tliis 
observation  is  consistent  with  previous  results  coneerning  die  optimal  ehoicc  cf  an  adjustment 
window^. 

The  eost  (in  terms  of  grid  points)  aicraged  across  speakers  obtained  by  such  pruning  can  be 
inferred  from  figure  3-6.  Note,  diat  if  one  attempts  to  meet  certain  performance  goals,  it  is  better 
to  use  the  data  obtained  for  a  speaker  with  die  highest  run  times,  radicr  than  the  average  across 
speakers.  For  the  purpose  of  comparison,  however,  we  have  chosen  to  present  the  data  in  this 
way. 

To  obuin  optimal  pcrfonnance  daui  for  the  preset  thresholds  algorithm,  two  thresholds  were 
determined  empirically,  providing  an  error  rate  equivalent  to  an  algoridim  where  no  pruning  was 
performed,  while  minimi/ing  for  computational  cost. 
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4.  Summary  and  Conclusion 

Wc  have  shown  that  the  branch-and-bound-witli-pruning-mcthod  is  die  fastest  algorithm  of 
all  the  nictliods  we  have  investigated.  More  importantly  it  is  insensitive  to  changes  in  parametric 
representation  or  in  vocabulary.  This  insensitivity  to  changes  proves  to  be  very  beneficial  in 
research  systems  when  almost  all  aspects  of  the  system  arc  changing  continuously,  since 
optimization  and  tuning  of  Uircsholds  is  usually  time  consuming  and  cumbersome. 


More  specifically,  the  advantages  are; 

•  This  method  yields  5  times  faster  operation  for  our  recognition  system  than 
performing  a  conventional  exhaustive  search.  (Using  a  lead  tlircshold  of  15%  of  the 
length  of  the  test  token  {P  =  15)  as  pruning  factor  in  addition  to  a  branch  and  bound 
based  search  method  (which  yields  approximately  tlic  same  error  rate  as  without 
pruning)  and  using  a  search  space  window  of  ±50  msec'’) 

•  Substantial  cost  reductions  were  achieved  due  to  the  insensitivity  of  the  algorithm  in 
face  of  system  changes  such  as  changes  in  paramcu'ic  representation  or  vocabulary, 
thus  eliminating  the  need  for  costly  retuning. 

•  Flexible  pruning  thresholds  (from  no  paining  at  all  up  to  rigid  paining)  allow  to 
manually  trade  off  efficiency  and  recognition  performance,  if  so  desired. 

•  If  no  pruning  is  performed,  the  algorithm  reduces  to  the  branch  and  bound  search 
guaranteeing  optimality.  This  prov  ides  identical  results  as  exhaustive  search,  while 
reducing  the  computational  cost  by  about  60%. 

•  It  is  also  possible  to  compute  the  guaranteed  n-best  candidates  while  obtaining  more 
efficient  operation. 


The  disadvantage  of  this  technique  is  its  higher  requirements  for  primary'  memory  storage, 
since  several  matches  are  operated  on  "in  parallel".  For  systems  with  insufficient  local  memory, 
fast  software  implementations  of  such  a  technique  and  Vl.SI-implementations  might  therefore  be 
faced  more  severely  by  the  problem  of  perfoniiing  fast  I/O  than  by  doing  the  computation 
necessary  for  recognition. 
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